GH-47995: [C++][Parquet] Fix empty string min/max statistics being lost during merge#48717
GH-47995: [C++][Parquet] Fix empty string min/max statistics being lost during merge#48717rynewang wants to merge 3 commits intoapache:mainfrom
Conversation
…ing lost during merge Prior to this change, CleanStatistic() for ByteArray rejected statistics where ptr == nullptr. However, empty strings can have ptr == nullptr with len == 0, causing valid statistics to be discarded when the minimum value is an empty string. The fix introduces a sentinel pointer (kNoValueSentinel) distinct from nullptr to mark "no value" in ByteArray statistics. This allows CleanStatistic to distinguish between "no min/max computed" (sentinel) and "min/max is empty string" (nullptr with len=0). FLBA is unchanged since it has fixed length and no "empty" concept.
|
|
| static T DefaultMin() { return T{0, kNoValueSentinel}; } | ||
| static T DefaultMax() { return T{0, kNoValueSentinel}; } |
There was a problem hiding this comment.
I think these could just return ByteArray("") (or ByteArray(std::string_view("")) if that doesn't work?)
| using Base = BinaryLikeCompareHelperBase<ByteArrayType, is_signed>; | ||
| using T = ByteArray; | ||
|
|
||
| // Use kNoValueSentinel instead of nullptr to distinguish "no value" from empty string. |
There was a problem hiding this comment.
This is rather confusing, why not do the reverse? i.e. have nullptr mean a missing statistic, while a non-null empty string would mean an empty statistic.
|
@pitrou Thanks for the review! After investigation, I found that the sentinel approach is needed because: The bug is manifestable through the public ByteArray empty{0, nullptr};
writer->WriteBatch(n, nullptr, nullptr, &empty);The empty string statistics are lost because Why other paths work:
Why I've added tests using the public
|
Tests that empty strings represented as ByteArray{0, nullptr} are correctly
handled in statistics when using the public WriteBatch API.
- EmptyStringWithNullptrStats: verifies min is empty string, not skipped
- AllEmptyStringsWithNullptrStats: verifies HasMinMax is true
- EmptyStringWithNullptrInRawByteArray: unit test for Update/Merge
|
I agree that sentinel is better as it never conflict with user input. |
The tests now use ByteArray("") which has a valid non-null pointer,
avoiding undefined behavior in PlainEncoder when encoding nullptr.
Also fixes clang-format issues with long assertion lines.
Why is this correct? I'm not sure it is ok to pass a nullptr ByteArray to WriteBatch. Also, if you pass |
Rationale for this change
Fixes #47995
When merging ByteArray statistics, empty string min/max values were incorrectly discarded. This happened because
CleanStatistic()rejected statistics whereptr == nullptr, but empty strings can legitimately haveptr == nullptrwithlen == 0.What changes are included in this PR?
Introduces a sentinel pointer (
kNoValueSentinel) distinct fromnullptrto mark "no value" in ByteArray statistics. This allowsCleanStatisticto distinguish between:FLBA is unchanged since it has fixed length and no "empty" concept.
Are these changes tested?
Yes. Added comprehensive tests covering all combinations of:
Are there any user-facing changes?
No API changes. This is a bug fix that preserves empty string statistics correctly during merge operations.